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Abstract 

We consider linear models for stochastic dynamics. To any such model can be as- 
sociated a network (namely a directed graph) describing which degrees of freedom 
interact under the dynamics. We tackle the problem of learning such a network 
from observation of the system trajectory over a time interval T. 
We analyze the t\ -regularized least squares algorithm and, in the setting in which 
the underlying network is sparse, we prove performance guarantees that are uni- 
form in the sampling rate as long as this is sufficiently high. This result substan- 
tiates the notion of a well defined 'time complexity' for the network inference 
problem. 

keywords: Gaussian processes, model selection and structure learning, graphical models, sparsity 
and feature selection. 

1 Introduction and main results 

Let G = (V, E) be a directed graph with weight E 1R associated to the directed edge (j, i) from 
j E V to i € V. To each node i E V in this network is associated an independent standard Brownian 
motion 6$ and a variable Xi taking values in R and evolving according to 

dxi(t)= A ij x j(t)dt+ dbi{t), 



+ <■ 



where d+i = {j E V : (j, i) E E} is the set of 'parents' of i. Without loss of generality we shall 
take V — [p] = {1, . . . ,p}. In words, the rate of change of Xi is given by a weighted sum of the 
current values of its neighbors, corrupted by white noise. In matrix notation, the same system is then 
represented by 

dx(t) = A°x(t) dt + db(t) , (1) 

with x(t) E R p , b(t) a p-dimensional standard Brownian motion and A E R pxp a matrix with 
entries {^«}i,je[p] whose sparsity pattern is given by the graph G. We assume that the linear system 
x(t) = A°x(t) is stable (i.e. that the spectrum of A is contained in {z E C : Rc(z) < 0}). Further, 
we assume that x(t = 0) is in its stationary state. More precisely, a;(0) is a Gaussian random variable 
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independent of b(t), distributed according to the invariant measure. Under the stability assumption, 
this a mild restriction, since the system converges exponentially to stationarity. 

A portion of time length T of the system trajectory {a:(i)}te[o,T] is observed and we ask under which 
conditions these data are sufficient to reconstruct the graph G (i.e., the sparsity pattern of A ). We 
are particularly interested in computationally efficient procedures, and in characterizing the scaling 
of the learning time for large networks. Can the network structure be learnt in a time scaling linearly 
with the number of its degrees of freedom? 

As an example application, chemical reactions can be conveniently modeled by systems of non- 
linear stochastic differential equations, whose variables encode the densities of various chemical 
species JTJ [2j . Complex biological networks might involve hundreds of such species [3|, and learn- 
ing stochastic models from data is an important (and challenging) computational task |4|. Consider- 
ing one such chemical reaction network in proximity of an equilibrium point, the model (Q~|i can be 
used to trace fluctuations of the species counts with respect to the equilibrium values. The network 
G would represent in this case the interactions between different chemical factors. Work in this area 
focused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for 
fixed p, as T — > oo, while we will tackle here the regime in which both p and T diverge. 

Before stating our results, it is useful to stress a few important differences with respect to classical 
graphical model learning problems: 

(i) Samples are not independent. This can (and does) increase the sample complexity. 

(ii) On the other hand, infinitely many samples are given as data (in fact a collection indexed 
by the continuous parameter t £ [0, T]). Of course one can select a finite subsample, for 
instance at regularly spaced times {x(i 7?)}i=o,i,.... This raises the question as to whether 
the learning performances depend on the choice of the spacing rj. 

(Hi) In particular, one expects that choosing ?/ sufficiently large as to make the configurations in 
the subsample approximately independent can be harmful. Indeed, the matrix A contains 
more information than the stationary distribution of the above process CO, and only the 
latter can be learned from independent samples. 

(iv) On the other hand, letting rj — > 0, one can produce an arbitrarily large number of distinct 
samples. However, samples become more dependent, and intuitively one expects that there 
is limited information to be harnessed from a given time interval T. 

Our results confirm in a detailed and quantitative way these intuitions. 
1.1 Results: Regularized least squares 

Regularized least squares is an efficient and well-studied method for support recovery, 
discuss relations with existing literature in Section [T31 

In the present case, the algorithm reconstructs independently each row of the matrix A 
row, A®, is estimated by solving the following convex optimization problem for A r € W 

minimize C(A r ;{x(t)} te [ 0>T ]) + X\\A r \\i , (2) 

where the likelihood function C is defined by 

C(A r - {x(t)} te[0 . T] ) = -L JjA* r x(t)) 2 dt-^ J T {A* r x{t)) dx r (t) . (3) 

(Here and below M* denotes the transpose of matrix/vector M.) To see that this likelihood function 
is indeed related to least squares, one cm formally write x r (t) — dx r (t) /dt and complete the square 
for the right hand side of Eq. ©, thus getting the integral J (A*x{t) — x r (t)) 2 dt — J x r {t) 2 dt. 
The first term is a sum of square residuals, and the second is independent of A. Finally the l\ 
regularization term in Eq. (f2j) has the role of shrinking to a subset of the entries Ay thus effectively 
selecting the structure. 

Let S° be the support of row A° r , and assume IS" ! < k. We will refer to the vector sign(A°) as to 
the signed support of A® (where sign(0) = by convention). Let A max (M) and A m j n (A/) stand for 
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the maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by A m i n 
the smallest absolute value among the non-zero entries of row 

When stable, the diffusion process (Q~|) has a unique stationary measure which is Gaussian with 
covariance Q° £ W xp given by the solution of Lyapunov's equation |5| 

A°Q° + Q°(A )* +1 = 0. (4) 

Our guarantee for regularized least squares is stated in terms of two properties of the covariance Q° 
and one assumption on p m i n (A°) (given a matrix M, we denote by Ml,b its submatrix Ml.r = 

(a) We denote by C m i n = A m i n (Q^o go) the minimum eigenvalue of the restriction of Q° to 
the support S° and assume C m ; n > 0. 

(b) We define the incoherence parameter a by letting lQ rs ) a ,s (Q°s°,s°) * II oo = 1 — a, 
and assume a > 0. (Here ||| • loo is the operator sup norm.) 

(c) We define p m i n (^4°) = -A max ((A° + A°*)/2) and assume p m - m (A°) > 0. Note this is a 
stronger form of stability assumption. 

Our main result is to show that there exists a well defined time complexity, i.e. a minimum time 
interval T such that, observing the system for time T enables us to reconstruct the network with 
high probability. This result is stated in the following theorem. 

Theorem 1.1. Consider the problem of learning the support S° of row A® of the matrix A from a 
sample trajectory {^(i)}te[o,Tl distributed according to the model (Q. // 

10 4 fc 2 (fc Pmin (A")- 2 + A m f n ) /jgfcx 

then there exists A such that i\-regularized least squares recovers the signed support of A® with 
probability larger than 1 — S. This is achieved by taking A = -\/36 \og(4 : p/S)/(Ta' 2 p m i n (A )) . 

The time complexity is logarithmic in the number of variables and polynomial in the support size. 
Further, it is roughly inversely proportional to p m i n (A°), which is quite satisfying conceptually, 
since /0 m i n (^4 ) _1 controls the relaxation time of the mixes. 



1.2 Overview of other results 



So far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant state- 
ments, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics, 
with parameter r] > 0: 

x(t) =x(t-l) + r]A x(t~l)+ w(t), telNo- (6) 
Here x(t) £ W is the vector collecting the dynamical variables, A £ R pxp specifies the dynamics 
as above, and {w(t)}t>o is a sequence of i.i.d. normal vectors with covariance r/ I pxp (i.e. with 
independent components of variance rj). We assume that consecutive samples {x(t)}o< t <n are 
given and will ask under which conditions regularized least squares reconstructs the support of A . 

The parameter r\ has the meaning of a time-step size. The continuous-time model (Q~|i is recovered, 
in a sense made precise below, by letting 77 — ^ 0. Indeed we will prove reconstruction guarantees 
that are uniform in this limit as long as the product nr\ (which corresponds to the time interval T in 
the previous section) is kept constant. For a formal statement we refer to Theorem l3.ll Theorem ll.il 
is indeed proved by carefully controlling this limit. The mathematical challenge in this problem is 
related to the fundamental fact that the samples {x(t)}o< t <n are dependent (and strongly dependent 
as 77 —> 0). 

Discrete time models of the form (|6]l can arise either because the system under study evolves by 
discrete steps, or because we are subsampling a continuous time system modeled as in Eq. ((T). 
Notice that in the latter case the matrices A° appearing in Eq. (0 and (fl]i coincide only to the zeroth 
order in ?y. Neglecting this technical complication, the uniformity of our reconstruction guarantees 
as rj — > has an appealing interpretation already mentioned above. Whenever the samples spacing 
is not too large, the time complexity (i.e. the product nrj) is roughly independent of the spacing 
itself. 
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1.3 Related work 



A substantial amount of work has been devoted to the analysis of i\ regularized least squares, and 
its variants [6 7,8 9 1 1 . The most closely related results are the one concerning high-dimensional 
consistency for support recovery lfTTl[T2l . Our proof follows indeed the line of work developed in 
these papers, with two important challenges. First, the design matrix is in our case produced by 
a stochastic diffusion, and it does not necessarily satisfies the irrepresentability conditions used by 
these works. Second, the observations are not corrupted by i.i.d. noise (since successive configura- 
tions are correlated) and therefore elementary concentration inequalities are not sufficient. 

Learning sparse graphical models via l\ regularization is also a topic with significant literature. In 
the Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples 
ifTJl . In the context of binary pairwise graphical models, Ref. [ 1 1 1 proves high-dimensional con- 
sistency of regularized logistic regression for structural learning, under a suitable irrepresentability 
conditions on a modified covariance. Also this paper focuses on i.i.d. samples. 

Most of these proofs builds on the technique of 11121 . A naive adaptation to the present case allows 
to prove some performance guarantee for the discrete-time setting. However the resulting bounds 
are not uniform as r\ — > for nr\ — T fixed. In particular, they do not allow to prove an analogous 
of our continuous time result, Theorem ll.il A large part of our effort is devoted to producing more 
accurate probability estimates that capture the correct scaling for small r\. 

Similar issues were explored in the study of stochastic differential equations, whereby one is often 
interested in tracking some slow degrees of freedom while 'averaging out' the fast ones fl4*l . The 
relevance of this time-scale separation for learning was addressed in lfl5l . Let us however emphasize 
that these works focus once more on system with a fixed (small) number of dimensions p. 

Finally, the related topic of learning graphical models for autoregressive processes was studied re- 
cently in ifTBIfTTl . The convex relaxation proposed in these papers is different from the one devel- 
oped here. Further, no model selection guarantee was proved in ifTBIfTTIl . 

2 Illustration of the main results 

It might be difficult to get a clear intuition ofTheorem ll.il mainly because of conditions (a) and (6), 
which introduce parameters C m - ln and a. The same difficulty arises with analogous results on the 
high-dimensional consistency of the LASSO ITTl[T2l . In this section we provide concrete illustration 
both via numerical simulations, and by checking the condition on specific classes of graphs. 

2.1 Learning the laplacian of graphs with bounded degree 

Given a simple graph Q — (V, £) on vertex set V = [p], its laplacian A s is the symmetric p x p 
matrix which is equal to the adjacency matrix of Q outside the diagonal, and with entries A? = 
— deg(i) on the diagonal [ 18 1. (Here deg(i) denotes the degree of vertex i.) 

It is well known that A e is negative semidefinite, with one eigenvalue equal to 0, whose multiplicity 
is equal to the number of connected components of Q. The matrix A = —to I + A s fits into 
the setting of Theorem 11.11 for m > 0. The corresponding model dl.lt describes the over-damped 
dynamics of a network of masses connected by springs of unit strength, and connected by a spring 
of strength to to the origin. We obtain the following result. 

Theorem 2.1. Let Q be a simple connected graph of maximum vertex degree k and consider the 
model f li.il ) with A° = —to I + A^ where ISP is the laplacian ofQ and m > 0. If 



then there exists A such that l\-regularized least squares recovers the signed support of A®, with 
probability larger than 1 — <5. This is achieved by taking A = -\/36(fc + to) 2 \og(Ap/ 5) / (Tm 3 ). 

In other words, for to bounded away from and oo, regularized least squares regression correctly 
reconstructs the graph Q from a trajectory of time length which is polynomial in the degree and 
logarithmic in the system size. Notice that once the graph is known, the laplacian A e is uniquely 
determined. Also, the proof technique used for this example is generalizable to other graphs as well. 




(7) 
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Figure 1: (left) Probability of success vs. length of the observation interval nrj. (right) Sample 
complexity for 90% probability of success vs. p. 

2.2 Numerical illustrations 

In this section we present numerical validation of the proposed method on synthetic data. The results 
confirm our observations in Theorems 11.11 and 13.11 below, namely that the time complexity scales 
logarithmically with the number of nodes in the network p, given a constant maximum degree. 
Also, the time complexity is roughly independent of the sampling rate. In Fig. Q] and [2] we consider 
the discrete-time setting, generating data as follows. We draw A as a random sparse matrix in 
{0, 1}p x p with elements chosen independently at random with P(A^ = 1) = k/p, k = 5. The 
process Xq = {x(t)}o<t< n is then generated according to Eq. ©. We solve the regularized least 
square problem (the cost function is given explicitly in Eq. ([8j for the discrete-time case) for different 
values of n, the number of observations, and record if the correct support is recovered for a random 
row r using the optimum value of the parameter A. An estimate of the probability of successful 
recovery is obtained by repeating this experiment. Note that we are estimating here an average 
probability of success over randomly generated matrices. 

The left plot in Fig[T]depicts the probability of success vs. nrj for n — 0.1 and different values of 
p. Each curve is obtained using 2 11 instances, and each instance is generated using a new random 
matrix A . The right plot in FigQ]is the corresponding curve of the sample complexity vs. p where 
sample complexity is defined as the minimum value of nrj with probability of success of 90%. As 
predicted by Theorem |2. li the curve shows the logarithmic scaling of the sample complexity with p. 

In Fig. |2] we turn to the continuous-time model (Q]). Trajectories are generated by discretizing this 
stochastic differential equation with step 5 much smaller than the sampling rate n. We draw random 
matrices ^4° as above and plot the probability of success forp = 16, k = 4 and different values of n, 
as a function of T. We used 2 11 instances for each curve. As predicted by Theorem ll.il for a fixed 
observation interval T, the probability of success converges to some limiting value as rj — > 0. 

3 Discrete-time model: Statement of the results 

Consider a system evolving in discrete time according to the model (O, and let Xq = {x(t)}o<t< n 
be the observed portion of the trajectory. The r th row A® is estimated by solving the following 
convex optimization problem for A r € W 

minimize L(A r ; Xq) + X\\A r \\i , (8) 

where 

-. n — l 

L{A r -x n Q ) = — Y^{x r (t+l)-x r {t)-nA* r x(t)} 2 . (9) 
2n z n 

Apart from an additive constant, the 77 — ^ limit of this cost function can be shown to coincide 
with the cost function in the continuous time case, cf. Eq. OJ. Indeed the proof of Theorem ll.il will 
amount to a more precise version of this statement. Furthermore, L(A r ; Xq) is easily seen to be the 
log-likelihood of A r within model ((6). 
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Figure 2: (right)Probability of success vs. length of the observation interval nrj for different values 
of n. (left) Probability of success vs. 77 for a fixed length of the observation interval, inn = 150) . 
The process is generated for a small value of r\ and sampled at different rates. 



As before, we let S° be the support of row A®, and assume < k. Under the model © x(t) has 
a Gaussian stationary state distribution with covariance Q° determined by the following modified 
Lyapunov equation 

A°Q° + Q°(A )* +r ] A Q°{A () )* +1 = 0. (10) 

It will be clear from the context whether A°/Q° refers to the dynamics/stationary matrix from the 
continuous or discrete time system. We assume conditions (a) and (b) introduced in Section fTTTl and 
adopt the notations already introduced there. We use as a shorthand notation cr max = <r max (I+r) A ) 
where er max (.) is the maximum singular value. Also define D = (l — cr max ) /v ■ We will assume 
D > 0. As in the previous section, we assume the model © is initiated in the stationary state. 

Theorem 3.1. Consider the problem of learning the support S° of row A® from the discrete-time 
trajectory {x(t)} <t< n - If 

then there exists A such that l\-regularized least squares recovers the signed support of A® with 
probability larger than 1 — 6. This is achieved by taking A = -\/(36 \og(4:p/6))/(Da 2 nr]). 

In other words the discrete-time sample complexity, n, is logarithmic in the model dimension, poly- 
nomial in the maximum network degree and inversely proportional to the time spacing between 
samples. The last point is particularly important. It enables us to derive the bound on the continuous- 
time sample complexity as the limit n — > of the discrete-time sample complexity. It also confirms 
our intuition mentioned in the Introduction: although one can produce an arbitrary large number 
of samples by sampling the continuous process with finer resolutions, there is limited amount of 
information that can be harnessed from a given time interval [0, T]. 



4 Proofs 



In the following we denote by X € R nxp the matrix whose (t + l) th column corresponds to the 
configuration x(t), i.e. X = [x(0), x(l), . . . , x(n — 1)]. Further AX g R nxp is the matrix contain- 
ing configuration changes, namely AX = [x(l) — x(0), . . . , x(n) — x(n — 1)]. Finally we write 
W = [w(l), . . . , w(n — 1)] for the matrix containing the Gaussian noise realization. Equivalently, 

W = AX -nAX . 

The r th row of W is denoted by W r . 

In order to lighten the notation, we will omit the reference to Xq in the likelihood function (0 and 
simply write L(A r ). We define its normalized gradient and Hessian by 

G = -VL(A° r ) = —XW; , Q = V 2 L(A°) = -XX* . (12) 
nrj ' n 
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4.1 Discrete time 



In this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem 
13.11 We start by stating a set of sufficient conditions for regularized least squares to work. Then we 
present a series of concentration lemmas to be used to prove the validity of these conditions, and 
finally we sketch the outline of the proof. 

As mentioned, the proof strategy, and in particular the following proposition which provides a com- 
pact set of sufficient conditions for the support to be recovered correctly is analogous to the one in 
fl2l . A proof of this proposition can be found in the supplementary material. 

Proposition 4.1. Let a, G m i„ > be be defined by 

A m in(Qsa go) = G m i n , HQ^S ) ,.? (Qs'^.S ) l°° = ^ — a • 

If the following conditions hold then the regularized least square solution ([8]) correctly recover the 
signed support sign(A^): 

l|G|U<^, ||G5o||oo< Ami ^ mi " -A, (14) 

tQ(s°)°,s° - Q( S o)c S o|||oo < i2~7j[ > tQs°,s° ~ Qs°,s4°° ^ 12"^ ■ ( ' 15 - ) 

Further the same statement holds for the continuous model\3\ provided G and Q are the gradient 
and the hessian of the likelihood (0. 

The proof of Theorem 13.11 consists in checking that, under the hypothesis (fTTT l on the number of 
consecutive configurations, conditions ( TBl i to ( fT3b will hold with high probability. Checking these 
conditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is 
taken with respect to a stationary trajectory, we have E{G} = 0, E{Q} = Q°. 

4.1.1 Technical lemmas 

In this section we will state the necessary concentration lemmas for proving Theorem 13. II These 
are non-trivial because G, Q are quadratic functions of dependent random variables (the samples 
{x(£)}o<t< rl ) • The proofs of Proposition |472] of Proposition |4.3l and Corollary 14. 4| can be found in 
the supplementary material provided. 

Our first Proposition implies concentration of G around 0. 

Proposition 4.2. Let S C [p] be any set of vertices and e < 1/2. /fcr max = cr max (7 + -q A ) < 1, 
then 

P{||G S || 00 >£} <2|S|e-"( 1 - ff »» )e2/4 . (16) 

We furthermore need to bound the matrix norms as per (15[ in proposition 14.11 First we relate 
bounds on \\Qjs — Q°jsl\oo with bounds on \Qij — Q%\, (i € J, i G S) where J and S are any 
subsets of {1, ...,p}. We have, 

n\lQ.is - Q%)\U >e)< \J\\S\ maxPGQ^ - Q%\ > e/\S\). (17) 
Then, we bound \Qij — Q ^ \ using the following proposition 

Proposition 4.3. Let i,j G {1, cr m ax = o- max (I + rjA a ) < 1, T = r\n > 3/D and < e < 

2/D where D = (1 — cr mal )/t) then, 

HlQij - Q%)\ > e) < 2e-^ {1 - a — fe \ (18) 



Finally, the next corollary follows from Proposition ^. 3| and Eq. ( fT71 ). 

Corollary 4.4. Let J, S (\S\ < k) be any two subsets of {1, ■■■,p} and cr max = cr max (7 + i]A°) < 1, 
e < 2k / D and nr) > 3/D (where D = (1 — o~ max )/r)) then, 

HllQjs - Q°j S \loo >e)< 2\J\ke-^ {1 - a — ^ . (19) 
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4. 1 .2 Outline of the proof of Theorem |37j] 



With these concentration bounds we can now easily prove Theorem 13.11 All we need to do is 
to compute the probability that the conditions given by Proposition 14.11 hold. From the statement 
of the theorem we have that the first two conditions (a,C mm > 0) of Proposition 14. 1 1 hold. In 
order to make the first condition on G imply the second condition on G we assume that Aa/3 < 
(A min C m i n )/(4fc) — A which is guaranteed to hold if 



the probability of the condition on G failing are upper bounded by 8/2 using Proposition 14.21 and 
Corollary |4.4l It is shown in the supplementary material that this is satisfied if condition ( fTTT l holds. 

4.2 Outline of the proof of Theorem O 

To prove Theorem l 1 . 1 | we recall that Proposition ^, ll holds provided the appropriate continuous time 
expressions are used for G and Q, namely 



These are of course random variables. In order to distinguish these from the discrete time version, 
we will adopt the notation G n , Q n for the latter. We claim that these random variables can be 
coupled (i.e. defined on the same probability space) in such a way that G n — > G and Q n — > Q 
almost surely as n — > oo for fixed T. Under assumption (0), it is easy to show that ( fTTT l holds for all 
n > no with no a sufficiently large constant (for a proof see the provided supplementary material). 
Therefore, by the proof of Theorem l3.11 the conditions in Proposition l4.1l hold for gradient G n and 
hessian Q n for any n > no, with probability larger than 1 — 8. But by the claimed convergence 
G n — > G and Q n —> Q, they hold also for G and Q with probability at least 1 — 8 which proves the 
theorem. 

We are left with the task of showing that the discrete and continuous time processes can be coupled 
in such a way that G n — > G and Q n — > Q. With slight abuse of notation, the state of the discrete 
time system (O will be denoted by x(i) where i E IN and the state of continuous time system (UJ by 
x(t) where ieR. We denote by Q° the solution of (0]) and by Q°(rj) the solution of (1101 . It is easy 
to check that Q° (?/) —> Q° as rj — >• by the uniqueness of stationary state distribution. 

The initial state of the continuous time system x(t — 0) is a N(0,Q o ) random variable inde- 
pendent of b{t) and the initial state of the discrete time system is defined to be x(i — 0) = 
(Q°(r])) 1/2 (Q )~ 1/2 x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the 
respective dynamical systems using the same matrix A using common randomness provided by the 
standard Brownian motion {&(£)}o<t<T in MP. In order to couple x(t) and x(i), we construct w(i), 
the noise driving the discrete time system, by letting w(i) = (b(Ti/n) — b(T(i — l)/n)). 

The almost sure convergence G n — > G and Q n —> Q follows then from standard convergence of 
random walk to Brownian motion. 
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A < ^4minCmin/<Bfc. 



(20) 



We also combine the two last conditions on Q, thus obtaining the following 





(22) 
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A Learning networks of stochastic differential equations: Supplementary 
materials 



In order to prove Proposition l4.1l we first introduce two technical lemmas. 
Lemma A.l. For any subset S C [p] the following decomposition holds, 



Qsc,s(Qs,s) 1 - Ti + T 2 + T 3 + Q° s a s (Q% s ) 1 



(23) 



where, 



Ti = Q&°,s((Qs,sy 1 -(Qs,s)j, (24) 
T2 = (Qs°,s-Q°sc,s)(Qs,s)~\ (25) 
T 3 = (Q s c, s -Q° s c iS )((Q s , s y 1 -(Q| iS ) _1 V (26) 

(27) 

/« addition, ifl\Q% c 5 (<9s s) 1 III 00 < 1 and A m i n (Qs,s) > C m i„/2 > the following relations 
hold, 

\\T4oo < ^IQzs-QlsU, (28) 

^min 

\mioo < t^-IQsos-Q^Ioo, (29) 

^min 

IT3I00 < ^lQsc,s-QscJioolQs,s-Q s,sloc- (30) 



The following lemma taken from the proofs of Proposition 1 in |fl9l and Proposition 1 in lfl2l 
respectively is the crux to guaranteeing correct signed-support reconstruction of A®. 

Lemma A.2. If Qs°,s° > 0> men me dual vector z from the KKT conditions of the optimization 
problem I© satisfies the following inequality, 

11? 11 <rtn (n Y 1 ill {■> ■ H^ H°° I ■ H%°) c ll°° nu 
\\ z (s°) c \\oc < lQ(s°) a ,s° [ys°,s ) loo I J- H t I H t • (31) 

In addition, if 

11^5° II 00 < A (32) 

f/zen \\A® — Aj-Woo < A m { n /2. The same result holds for problem ©. 

Proof of Proposition l4.lt To guarantee that our estimated support is at least contained in the true 
support we need to impose that \\zga \\oo < 1. To guarantee that we do not introduce extra elements 
in estimating the support and also to determine the correct sign of the solution we need to impose that 

ll^-^Hoo < A min /2. Now notice that since \ mi n(Q s o tS a) = C min the relation A min (Qs° : s ) > 
C m i n /2 is guaranteed as long as |Qsf°,s° — Q°so 50 Hoc < C m in/2. Using Lemma lATTI it is easy to 
see that the bounds of Proposition 14. 1 1 lead to the conditions of Lemma [AT2l being verified. Thus, 
these lead to a correct recovery of the signed structure of □ 

Lemma A.3. Let r, j E [p] and let p(r) represent a p x p matrix with all rows equal to zero 
except the r th row which equals the j row of (I + r/A a ) (the T th power of I + r)A° ). Let 
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R(j) e K,(«+™+i)x(«+™+i) be defined as, 





/ 














o \ 






















R = 


p(m) p(m — 1) 


... p(l) 


P(0) 













p(m + 1) p(m) 


... p(2) 


p(l) 


p(0) .. 


























\ p(m + n — 1) p(m + n — 2) 


■ ■ ■ P{n) 


p(n-l) 


p(n-2) .. 


• P(0) 


o / 


Define R(j) = 1/2{R + R*) and let v 


i denote its i 


!/l eigenvalue and assume 




^max 



■ (33) 



rjA°) < 1. Then, 



p(n+m+l) 

E v * = °' 

i=l 

max | | < 

i 

p(n+m+l) 



E ^ 



< 



1 ^max 

1 n 

2 1 — cr ma . 



(34) 
(35) 
(36) 



froo/ First it is immediate to see that ^p(j*+ m+1 ) Vi = Tr(R) = 0. Let Ji T represent a p x p 
matrix with zeros everywhere and ones in the block-position where p(r) appears and I 2t represent 
a similar matrix but with ones in the block-position where p(r)* appears. Then R can be written as, 

R= g ( E h T ®p{T)+h T ®p{ry j , (37) 

where ® denotes the Kronecker product of matrices. This expression can be used to compute an 
upper bound on | \ . Namely, 



max \Vi\ = (T ma x(i?) < y] 0-max(^lr ® P{t)) < <T max (Ji T )(T max (p(r)) 



r=0 

oo oo 

< ^ CT ^x(p(t)) < 5Z CT 
r=0 t=0 

For the other bound we do, 

(n+m+l) 



r=0 
1 



1 - cr max (<p*) 



r=0 



^ ^ 2 = Tr(i? 2 ) < 1 n 2 ]T Tr(p(r)p(r)*) 

1 

2 n E ii^( 7 

r=0 



1 n 

O / / max — 9 ^ — rr 
r— U 



(38) 
(39) 

(40) 

(41) 

(42) 
□ 



where in the last step we used the fact that < er max < 1- 

Lemma A.4. Let j e \p\. Define p(r) G R lxp tobethej th rowof{I+r]A ) T . Let<f>j e R«x(™+ m ) 
foe defined as, 

( p(m) p(m-l) ... p(l) p(0) ... \ 



p(m + 1) 



p(m) 



... p(2) p(l) 



P(0) 



\ p(m + n — 1) p(m + n — 2) ... p(n) p(n — 1) p(n — 2) 



P(0) 7 



(43) 
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Let vi denote the I th eigenvalue of the matrix R{i,j) = l/2($*$; + G WL^+m)x(n+m) 
(where i £ [p]j and assume cr max = Umaxf-? + < 1 then, 

M < 7T— ; V2 , (44) 

1 (n+m)p 2 2 / 3 1 \ 

n ^ ^ ~ (l-cr max ) 3 \ + 2n 1 - cr max J ' ( * 45 ' ) 



Proof. The first bound can be proved in a trivial manner. In fact, since for any matrix A and f? we 
have cr max (A + < crraax(^) + cr max ( J B) and o- max (AB) < (T max (A)a laax (B) we can write 

max \ n \ = o- max (l/2($5$ i + < l/2(a max ($|$ i ) + a max ($*$,)) (46) 



< cr max ($**i) < 0- m ax($i)cr ma x($j) < 7: r,, (47) 



(l-<7 



2 • 

m aX I 



where in the last inequality we used the fact <7 max ($j) < 1/(1 — <7 max )- The proof of this is just a 
copy of the proof of the bound (l35T l in Lemma lA3l 

Before we prove the second bound let us introduce some notation to differentiate p(r) associated 
with $j from p(r) associated with Let us call them p(r, j) and p(r, i) respectively. Now notice 
that can be written as a block matrix 



A D 
C B 



(48) 



where A, B, C and D are matrix blocks where each block is a p by p matrix. A has p x p blocks, 
B has n X n blocks, C has n X m blocks and D has m x n blocks. If we index the blocks of each 
matrix with the indices x, y these can be described in the following way 

m 

Axy = ^ p(m -x + s, i)*p(m -y + s, j) (49) 

s=l 
n— x 

B xy = ^2p(s,i)*p(s + x-y,j),x>y (50) 

s=0 
n-y 

B xy = ^2p(s + y-x,i)*p(s,j),x<y (51) 

s=0 
n—x 

Cxy = ^2p(s,i)*p{m-y + x + s,j) (52) 

s=0 
n-y 

D xy = p(m- x + y + s,i)*p(s,j). (53) 



With this in mind and denoting by A, B, C and D the symmetrized versions of these same matrices 
(e.g.: A = l/2(A + A*)) we can write, 

(n+m)p 

v? =Tr{R{i,jf) = Tr(A 2 ) + Tr(B 2 ) + 2Tr(CD). (54) 

1=1 

We now compute a bound for each one of the terms. We exemplify in detail the calculation of the 
first bound only. First write, 

rn m 

Tr(i 2 ) = ^^Tr(4,A;). (55) 
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Now notice that each Tr(A xy A xy ) is a sum over t\ , r 2 € [p] of terms of the type, 

(p(m - x + Ti,i)*p(m -y+ n,j) + p(m - x + n,j)*p(m -y + n,i)) x (56) 
x(p(m - y + T 2 ,j)*p(m - x + r 2 , i) + p(m - y + r 2 , i)* p(m — x + r 2 , j)). (57) 

The trace of a matrix of this type can be easily upper bounded by 

yn—x+T 1 +m-y+T 1 +m-y+T2+m—x+T 2 _ ^2(m-x)+2(m—y)+2T 1 +2T 2 (58) 

which finally leads to 

Tr(A 2 ) < - (59) 



Doing a similar thing to the other terms leads to 

(1 - er max ) 

(m-x)+2y+2Ti+2T 2 < 

(1 ^max) 



Tr(B 2 ) < £ £ <^ +2T2+2lx - yl < n _" ,3 (6°) 

I J- CTmax J 



m n m,n,n—y,n — y 

Tr(DC) = E TriP^D^) < £ <£<£-*> +2 « +2T i +2Ta < n , ; - (61) 

1=1 y-l a:,y,ri,T2 



Putting all these together leads to the desired bound. □ 

Proof of Proposition \4.2l We will start by proving that this exact same bound holds when the 
probability of the event {HGslloo > e} is computed with respect to a trajectory {x(t)}™ =0 that is 
initiated at instant t = —m with the value w(—m). In other words, x(—m) = w(—m). Assume 
we have done so. Now notice that as m — > oo, X converges in distribution to n consecutive 
samples from the model (O when this is initiated from stationary state. Since || G$ \\ oo is a continuous 
function of X = [x(0), x(n — 1)], by the Continuous Mapping Theorem, HGsHoo converges 
in distribution to the corresponding random variable in the case when the trajectory {x(i)}f =0 is 
initiated from stationary state. Since the probability bound does not depend on m we have that this 
same bound holds for stationary trajectories too. 

We now prove our claim. Recall that Gj = (XjW*)/(nn). Since X is a linear function of the in- 
dependent gaussian random variables W we can write XjW* = nz*R(j)z, where z € j>p(™+ m + 1 ) 
is a vector of i.i.d. N(0, 1) ran dom variables and R(j) e ]RP(«+™ l + 1 )xp(™+ m + 1 ) j s tne S y mme tric 
matrix defined in Lemma[7Q1 



Now apply the standard Bernstein method. First by union bound we have 

P{||G s ||oo > e} < 2\S\ maxP{z*R(j)z > ne} . 
Next denoting by {^}i<i< p ( n +m+i) tne eigenvalues of R(j), we have, for any 7 > 0, 

p(n-\-m-\-l) 



»{z*R{j)z > ne} = pj ^ 



v i z i > ne > 

i=l 
p(n+m+l) 



( (n+m+l)p \ 



Let 7 = 5(1 — c max )e. Using the bound obtained for | max^ vi\ in Eq. d35l ), Lemma |A~3l \2vci\ < e. 
Now notice that if \x\ < 1/2 then log(l — x) > —x — x 2 . Thus, if we assume e < 1/2 and given 
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that ^(™+ m+1 )p v . — q ( see gq J34I )) we can continue the chain of inequalities, 



^ (n+m+l)p 

\Gs\\°o > e ) < 2|S|maxexp ( ^ vf) j (62) 



< 2|5| exp ^-n(i(l - <w)e 2 - i(l - a max ) 2 e 2 (l - ow) -1 )) (63) 

< 2|5| cxp - <r max )e 2 ) . (64) 
where the second inequality is obtained using the bound in Eq. (f36b . □ 

Proof of Proposition I4.3t The proof is very similar to that of proposition 14.21 We will first show 
that the bound 

mQij-HQij)\ >e)< 2e~^ {1 " Tmaxfe \ (65) 
holds in the case where the probability measure and expectation are taken with respect to trajectories 
{x(i)}2 =0 that started at time instant t = —m with x(—m) = w(—m). Assume we have done so. 
Now notice that as m —> oo, X converges in distribution to n consecutive samples from the model 
|6]when this is initiated from stationary state. In addition, as m — > oo, we have from lemma lA5l that 
E(Qij) — > Qij- Since ; is a continuous function of X = [x(0), x(n — 1)], a simple application 
of the Continuous Mapping Theorem plus the fact that the upper bound is continuous in e leads us 
to conclude that the bound also holds when the system is initiated from stationary state. 

To prove our previous statement first recall the definition of Q and notice that we can write, 

Qij = -z*R(i,j)z, (66) 
n 

where z G R m+n is a vector of i.i.d. N(0, 1) and R(i,j) € R(»+m) * (™+m) is defined has in lemma 
IA.4I Letting v\ denote the I th eigenvalue of the symmetric matrix R(i,j) we can further write, 

(n+m)p 

Qij-E(Q i: j) = ~ v i{*i-l)- (67) 

By Lemma lA~4l we know that, 



i=i 



\n\ < - — -_ — (68) 



( 1 (7 max ) A 



(n+m)p 



1 k -sp \p. < 2 [l | 3 1 ) < 3 (69) 

71 i=\ 1 ~ ( 1_<7 niax) 3 V 2nl-cr max / ~ (l-cr max ) 3 ' 

where we applied T > 3/D in the last line. 

Now we are done since applying Bernstein trick, this time with 7 = 1/8 (1 — <T ma x) 3 e/f), and making 
again use of the fact that log(l — x) > —x — x 2 for |x| < 1/2 we get, 

HQij - E(Q y ) > e) = P( n{zf -l)>en/rf) (70) 

1=1 

< e -=-7E!ir ,p ,+7E!:r lf ^27 ! E!:r^ (72) 

< e 3^ max; , (73) 

where had to assume that e < 2/D in order to apply the bound on log(l — x). An analogous 
reasoning leads us to, 

P(Qi S - E(Qy) < -e) < e "^ (1_CTmax)3e2 (74) 

and the results follows. 
□ 
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Lemma A.5. As before, assume <7 max = cr max (I + r)A ) < 1 and consider that model © was 
initiated at time —m with w(—m), that is, x(—m) = w(—m) then 



l^-^I^^TT^t^- (75) 



Proof. Let p = I + r]A°. Since, 

oo 



1=0 



and 



we can write, 



n+m— 1 



HQij)=V E m+ , n 1 - (P 1 P*%, (77) 
^-^ n + m 

1=0 



n+m— 1 



Q?.-E(Q ij ) = 77 E E ^—(P l P* l h)- (78) 

\/— m+n / — 1 / 

Using the fact that for any matrix A and i? max i:) (^4^- ) < o- ma ^(A), o" max (^4_B) < 
<rma,x{A)a max (B) and <7 max (A + -B) < cr max (A) + cr max (B) and introducing the notation ( = p 2 
we can write, 

(/-n+m /- m +n — 2 \ M //-2 _i_ /-n+m o/*m+n+l\ 

f-F + -r- E ^ = — , vi n 2 (79) 
1 — C n + m / (m + n.)(l — (,)^ 

~ (m + n)(l - ff mai ) 2 ' ^ 

where we used the fact that for £ G [0, 1] and rt e IN we have 1 - C > 1 - VC and C 2 + C™ - 2 C 1+ " < 
1. □ 

Proof of Theorem l3.lt 

In order to prove Theorem 13.11 we need to compute the probability that the conditions given by 
Proposition 14.11 hold. From the statement of the theorem we have that the first two conditions 
(a, C m i n > 0) of Proposition 14. II hold. In order to make the first condition on G imply the second 
condition on G we assume that 

Att ^.minCmin * /o i \ 

T-^fc A (81) 

which is guaranteed to hold if 

A < ^min C min /8k. (82) 
We also combine the two last conditions on Q to 

IIIQm.so - Q[ P ],s4oo < j^-jf- ( 83 ) 

Where [p] = S° U (S°) c . We then impose that both the probability of the condition on Q failing 
and the probability of the condition on G failing are upper bounded by 8/2. Using Proposition 14.21 
we see that the condition on G fails with probability smaller than 8/2 given that the following is 
satisfied 

A 2 = ma^tnnD)- 1 log(4p/5). (84) 

But we also want d82l to be satisfied and so substituting A from the previous expression in (l82l we 
conclude that n must satisfy 

n > 2304k 2 C min ~ 2 A min - 2 a- 2 (Dj 1 )- 1 log(4p/<5). (85) 



15 



In addition, the application of the probability bound in Proposition l4.2l requires that 

X 2 a 2 

< 1/4 (86) 



9 

so we need to impose further that, 

n > 16(Z?77)- 1 log(4p/(5). (87) 
To use Corollary 14. 4l for computing the probability that the condition on Q holds we need, 

nri > 3/D, (88) 

and 

< 2kD-\ (89) 

12Vfc 

The last expression imposes the following conditions on k, 

k 3/2 > 2A- 1 aC min D. (90) 
The probability of the condition on Q will be upper bounded by 8/2 if 

n > 46087r 1 fc 3 a~ 2 C min " 2 L>- 3 log4pfc/(5. (91) 

The restriction d90l l on k looks unfortunate but since k > 1 we can actually show it always holds. 
Just notice a < 1 and that 

< <W(Q°) < T - ~ ~~ & D < a-UQs°,s°) (92) 

-t f max 

therefore C m ; n Z? < cr m i n (Q^ s o)/cmax(Qso 50) < 1. This last expression also allows us to 
simplify the four restrictions on n into a single one that dominates them. In fact, since C min D < 1 
we also have C m f n D~ 2 > C^^D -1 > 1 and this allows us to conclude that the only two conditions 
on n that we actually need to impose are the one at Equations d85l l. and ( |9TT >. A little more of algebra 
shows that these two inequalities are satisfied if 

™? > 1Qik2{k ° B L + KL) log(4pfe/^)- (93) 



This conclude the proof of Theorem |3.1| 
□ 

Lemma A.6. Let cr max ee a max (I + r/A ) and p min (A°) = -A max ((A° + (A°)*)/2) > then, 

-A mi „ > hmsup , (94) 

V 2 / 1^0 V 

l immf > _ Aroax M° + (^°n . (95) 

^0 77 \ 2 J 



Proof. 



l-\UL((I + r,A )*(I + r,A°)) 
V 

1 - XllLjl + + {A )*) + r7 2 (A°)M ) 

1 - (1 + tju^A + (A )* + y?^ )^ )^) 1 / 2 
V 



(96) 
(97) 
(98) 



where u is some unit vector that depends on 77. Thus, since y/1 + x = 1 + x/2 + 0(x 2 ), 

1-^max . / j4° + (A°)*\ /A + (A°)*\ 

limmf = - hmsup it it > -A max . (99) 

V-+0 77 7j— j-0 V 2 / V 2 / 

The other inequality is proved in a similar way. □ 
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Proof of Theorem|27D 

In order to prove Theorem |2.1| we first state and prove the following lemma, 

Lemma A.7. Let G be a simple connected graph of vertex degree bounded above by k. Let A 
be its adjacency matrix and A = —hl + A with h > k then for this A the system in (Q]) has 
Q° = -(l/2)(A a )- 1 and, 

lQ {S a ) c ,S°(Qs ,S ) 1 llloo = |||04(S°) C ,(S°) C ) ^(s°) ,s°lll°° — V 1 - (100) 



Proof. A is symmetric so A is symmetric. Since A is irreducible and non-negative, Perron- 
Frobenious theorem tells that A max (A) < k and consequently A max (A°) < —h + A max (A) < 
— h + k. Thus h > k implies that A° is negative definite and using equation © we can compute 
Q° = —(1/2)(A°)~ 1 . Now notice that, by the block matrix inverse formula, we have 

(Qso^y 1 = -zc-\ (ioi) 

C), (102) 

where C = Ag ^ — Ag (g * c (A,g s C ^g \c) 1 ^(s ) c ,s° ar, dthus 

l\Q(S°) c ,S° (Q S°,S") loo — l{^(S°) a ,{S°) a ) ^(S°) C ,S° loo- (103) 
Recall the definition of \\B\\ 

I s loo =maxV|Ly 

I * ' 

3 

Let z = h~ x and write, 

oo 

{A\ S o )C ,(so)c) _1 = -z(I - zl (s o ) c i(s o ) c) _1 = -z ^2(zA {SO) c d so)c) n 7 (105) 

J ^is°) c .s° — z ~ lz A(s°)c ,s°- (106) 

This allows us to conclude that |(j4-9go)o ^s ) ^ 1 ^is°) c s°\l°° * s m ^ act trie max i mum over all 
path generating functions of paths starting from a node i E (S°) c and hitting S° for a first time. 
Let f2j denote this set of paths, lu a general path in G and \uj\ its length. Let hi, denote the 
degree of each vertex visited by ui and note that k m < k, Vm. Then each of these path generating 
functions can be written in the following form, 

E zH ^ E y-^T— =E G ((fcz) T ^°), (107) 
ceo, fcl " fc M 

where Tj go is the first hitting time of the set S° by a random walk that starts at node i £ S° C and 
moves with equal probability to each neighboring node. But T,.so > 1 and kz < 1 so the previous 
expression is upper bounded by kz. □ 



|S| 00 =maxV|B y |. (104) 



Now what remains to complete the proof of Theorem 12.11 is to compute the quantities a, A m j n , 
Pmin(^4°) and C m i n in Theorem l 1.11 . From Lemma lATTl we know that a = 1 — k/(k + m). Clearly, 
A m i n = 1. We also have that p m i n (^4°) = cr m i n (A°) > k + m — er max (A) > m + k — k = rn. 
Finally, 

WQ^o) = |w-(^V) = Ij^z^ > k^^i >- ( 108 > 

where in the last step we made use of the fact that m + k > k. Substituting these values in the 
inequality from Theorem l 1.1 1 gives the desired result. 

□ 
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